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Abstract 



We construct a parametrization of the deep-inelastic structure function of the proton ^2(2;, Q^) 
based on all available experimental information from charged lepton deep-inelastic scattering 
experiments. The parametrization effectively provides a bias-free determination of the proba- 
bility measure in the space of structure functions, which retains information on experimental 
errors and correlations. The result is obtained in the form of a Monte Carlo sample of neural 
networks trained on an ensemble of replicas of the experimental data. We discuss in detail the 
techniques required for the construction of bias-free parameterizations of large amounts of struc- 
ture function data, in view of future applications to the determination of parton distributions 
based on the same method. 
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1 Introduction 



The requirements of precision physics at hadron coUiders have recently led to a rapid improve- 
ment in the techniques for the determination of parton distributions of the nucleon, which are 
mostly extracted from deep-inelastic structure functions [1]. Specifically, it is now mandatory 
to determine accurately the uncertainty on these quantities. The main problem to be faced 
here is that one is trying to determine an uncertainty on a function, i.e., a probability measure 
on a space of functions, and to extract it from a finite set of experimental data. This problem 
can be studied in a simpler context, namely, the determination from the pertinent data of a 
structure function and its associate error. This sidesteps the technical complication of extract- 
ing parton distributions from structure functions, but it does tackle the main issue, namely 
the determination of an error on a function. Furthermore, the determination of a structure 
function and associate error might be useful for a variety of applications, such as precision tests 
of QCD (determination of [2] , tests of sum rules) or the determination of polarized structure 
functions from asymmetry data [3] 

A new approach to this problem was recently proposed in Ref. [4], based on the use of 
neural networks as basic interpolating tools. The main idea of this approach is to train a set 
of neural networks on a set of Monte Carlo replicas of the experimental data which reproduces 
their probability distribution. Hence, whereas the Monte Carlo replicas reproduce faithfully 
the probability measure of the data for F2{x, Q'^) in the points of the (x, Q'^) plane where data 
are available, the neural networks provide an interpolation and extrapolation for all (x, Q^) 
subject to the only requirement of smoothness. The set of neural networks thus provides the 
desired probability measure, at least in the measured (x, Q"^) region, provided the sampling of 
the {x, Q'^) plane is not too coarse. 

In ref. [4] a parametrization of the proton, deuteron and nonsinglet F2 structure functions 
based on the BCDMS and NMC fixed-target deep-inelastic scattering data was constructed 
in this way. Here, we extend the results of ref. [4] by constructing a parametrization of the 
proton F2 structure function which includes all available data, in particular the HERA collider 
data. Besides the obvious motivation of having state-of-the art results for this quantity, the 
main aim of this work is to develop a set of techniques which are required for the application 
of the method of ref. [4] to cases where the handling of a large number of disparate data sets 
is required. This involves in particular the use of genetic algorithms for the training of neural 
networks. 

In sect. 2 we summarize the features of the experimental data. In sect. 3 we review the 
fitting method of ref. [4], emphasizing the improvements which have been introduced here. In 
sect. 4 we discuss the details of the training of neural nets to the current data set. In sect. 5 
our final results are presented and compared to those previously obtained in ref. [4]. 

2 Experimental data 

We construct a parametrization of F2 based on all available unpolarized charged lepton-proton 
deep- inelastic scattering data [5]. However, we do not include early SLAC data, for which the 
covariance matrix is not available, since they do not provide any extra kinematic coverage, and 
are anyway less precise than later data. This leaves a total of 13 experiments, listed in table 1, 
along with their main features. The coverage of the {x, Q^) kinematic plane afforded by these 
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Figure 1: Kinematic range of the experimental data 



data is shown in fig. 1. 

Structure functions are defined by parametrizing the deep-inelastic neutral current scatter- 
ing cross section as 



For the definition of kinematic variables see ref. [19]. We will construct a parametrization of 
the structure function F2{x,Q'^), which provides the bulk of the contribution to eq. ([Q). For 
all experiments the Fl contribution to the cross section has already been subtracted by the 
experimental collaborations, except for ZEUSBPC95, where we subtracted it using the values 
published by the same experiment. Note that the structure function F2 receives contributions 
from both 7 and Z exchange, though the Z contribution is only non-negligible for the high 

datasets ZEUS94, H197, H199 and HIOO. We will construct a parametrization of the 
structure function F2 defined in eq. (^, i.e. containing all contributions. When the experimental 
collaborations provide separately the contributions to F2 due to 7 or Z exchange we have 
recombined them in order to get the full F2 eq. (fT)). 

All the experiments included in our analysis provide full correlated systematics, as well as 
normalization errors. The covariance matrix can be computed from these as 

covjj = ( (yi^k(yj,k + FiFjajf + dijal^ , (2) 
,k=i J 

where Fj, Fj are central experimental values, CTj^fc are the Ngys correlated systematics, is 
the total normalization uncertainty, and the uncorrelated uncertainty cTj t is the sum of the 
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Table 1: Experiments included in this analysis. All values of a and cov are given as percentages. 



statistical uncertainty ai^g and the uncorrelated systematic uncertainties (when present): 



<^lt = <^ls + H^lk- (3) 

k=l 



The correlation matrix is then given by 



Pij = ^ J (4) 



where the total error ai^tot for the ^-th point is given by 



<^i,tot = yJalt + al^ + Ffal, (5) 
the total correlated uncertainty (jj^c is the sum of all correlated systematics 

<c=Y. <u- (6) 

k=l 

For the ZEUS94, ZEUSSVX95 and ZEUSBPT97 experiments some uncertainties are asym- 
metric. As well known [20-22], asymmetric errors cannot be combined in a simple multigaussian 
framework, and in particular they cannot be added to gaussian errors in quadrature. In the 
treatment of multigaussian errors, we will follow the approach of ref. [21,22], which, on top of 
several theoretical advantages, is closest to the ZEUS error analysis and thus adequate for a 
faithful reproduction of the ZEUS data. In this approach, a data point with central value xq 
and left and right asymmetric uncertainties a^i and (not necessarily positive) is described 
by a symmetric gaussian distribution, centered at 

{x) = xo + (7) 



2 a2 ((^R + <yLV 



and with width 

a^ = A-^^J. (8) 
The ensuing distribution can then be treated in the standard gaussian way. 
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3 Fitting technique 



The construction of a parametrization of F2{x, Q"^) according to the method of ref. [4] consists 
of two steps: generation of a set of Monte Carlo rephca of the original data, and training of a 
neural network to each replica. We summarize here the main features of these two steps, and 
the improvements that we introduced over the methods of ref. [4]. 

The Monte Carlo replicas of the original experiment are generated as a multigaussian dis- 
tribution: each replica is given by a set of values 

where F^'^^'^^ is the z-th data point, we introduce an independent univariate gaussian random 
number A'^^ for each independent error source, and the various errors are defined in eqs. (jSlEI)- 
The value of A^rep is determined in such a way that the Monte Carlo set of replicas models 
faithfully the probability distribution of the data in the original set. A comparison of expec- 
tation values, variance and correlation of the Monte Carlo set with the corresponding input 
experimental values as a function of the number of replicas is shown in fig. El where we display 
scatter plots of the central values and errors for samples of 10, 100 and 1000 replicas. The 
corresponding plot for correlations is essentially the same as that shown in ref. [4]. A more 
quantitative comparison can be performed by defining suitable statistical estimators (see the 
appendix). Results are presented in table |21 Note in particular the scatter correlations r for 
central values, errors and correlations, which indicate the size of the spread of data around a 
straight line. The table shows that a sample of 1000 replicas is sufficient to ensure average 
scatter correlations of 99% and accuracies of a few percent on structure functions, errors and 
correlations. 

A^rep neural networks [23] are then trained on the Monte Carlo data, by training each neural 
network on all the F^'^^^\k) data points in the k-th replica. The architecture of the networks 
is the same as in ref. [4]. The training is subdivided in three epochs, each based on the 
minimization of a different error function. In the first training epoch, the networks are trained 
to minimize the function 

j^ik) ^ ^^(art)(fc) _ ^(not)(fe)^^ ^^qVj 



^ l\l 1 . 

i=l 



A^dat 



i.e., the deviation from the central value per data point. In the second epoch the function to be 
minimized is the uncorrelated P^r data point, namely, the computed omitting correlated 
systematics: 



.(fc) _ 2 



1 ^dat F} 

art)(fc) _ p(nct){fc) 

-L ^ A V 

AI , ^—^ (cxp) 



-^2 Adiag — i\T. ^ (cxp)2 ' 



where (Ji^"^^ is defined in eq. ©. Finally, in the third epoch the full per data point is 
minimized, namely 

1 -'Vdat 

Ff^ = ^^{art){fc) _ ^(net)(fc)\ ^{k)-^ /_p(art)(fc) _ ^(net){fc)\ ^-|^2) 

Adat 
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Table 2: Comparison between experimental and Monte Carlo data. 

The experimental data have (a^'^^'P^) = 0.0311, /p(^^P)\ = 0.2914 and /cov^^^pA = 0.00015. 

\ /dat \ /dat \ /dat 

All statistical indicators are defined in the appendix. 



where cov^''\j is the covariance matrix for the k-th replica, defined as in eq. 0, but with the 
normalization uncertainty included as an overall rescaling of the error due to the normalization 
offset of that replica: namely. 



COV^ 



E I +^^J^^'^^,t , (13) 

,p=l 



with 

4':l = a + r^N^N)a^,a, (14) 

where r^'^^ is as in eq. 0. This is necessary in order to avoid a biased treatment of the 
normalization errors [22,24]. 

The rationale behind this three-step procedure is that the true minimum which the fitting 
procedure must determine is that of the full gQ- ^12|) . However, this is nonlocal and time 
consuming to compute. It is therefore advantageous to look for its rough features at first, then 
refine its search, and finally determine its actual location. 

The minimum during the first two epochs is found using back-propagation (BP) (see ref. [4]). 
This method is not suitable for the minimization of the nonlocal function eq. (|T2|l . In ref. [4] 
BP was used throughout, and the third epoch was omitted. This is acceptable provided the 
total systematics is small in comparison to the statistical errors, and indeed it was verified 
that a good approximation to the minimum of eq. (|T^ could be obtained from the ensemble of 
neural networks. This is no longer the case for the present extended data set, as we shall see 
explicitly in section 5. Therefore, the full eq. (|T2|l is minimized in the third training epoch 
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Mean values 




by means of genetic algorithms (GA), previously used and discussed by some of us for related 
purposes in ref. [25]. 

In comparison to previous work [4,25], we have implemented several improvements, both 
in the BP and GA training epochs. In the BP epoch, we use on line training as in ref. [4], 
i.e. the parameters of the neural network are updated after each data point has been shown 
to it. This defines a training cycle. In ref. [4] it was shown that the length of training needed 
to achieve a given value of E3 can differ significantly between experiments: it is larger for 
experiments which have smaller errors, contain more data, or cover kinematic regions where 
the structure function varies more rapidly. If one wants to end up with a similar value of 
for all experiments it is then necessary to adjust the relative length of training of different data 
sets. In ref. [4] this was achieved by finding by trial and error an optimal fixed weight for the 
two experiments included in the fit. This procedure is clearly not viable when the number of 
experiments is large. Therefore, we have implemented a dynamical weighted training, whereby 
the weight Pi of each experiment is chosen initially to be the same for all experiments, and 
then adjusted dynamically according to the relative contribution of each experiment E^j to the 
total Es eq. i^: 

^ P. . (15) 

j=l ^3,i 

The value of Er^i is updated from the full data set every 1.25 x 10^ training cycles; because 
there are ~ 1700 data points, this ensures that between updates each data point has been seen 
by the net about 700 times on average in the unweighted case. This method is not viable in 
the third (GA) training epoch, where E3 can only be computed from the full set of data points 
(i.e., the training is necessarily batched, and not on-line). Therefore, one cannot choose to 
show a subset of data more often. One could in principle reweight the contribution of the single 
experiments to E3, but this might distort the global minimum in an unpredictable way. 

In the GA epoch, we have introduced two improvements in comparison to the methods 
described in ref. [25]. First, we have introduced multiple mutations, specifically three nested 
mutations for each training cycle. ^ The purpose of this is to avoid local minima, thereby in- 
creasing the speed of training. It is crucial that rates for these additional mutations are large, 

^Note that GA training cycles in ref. [25] are referred to as generations, as it is customary for genetic 
algorithms. 
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in order to allow for jumps from a local minimum to a deeper one. We find that one addi- 
tional mutation with probability 20%, followed by two additional mutations with probability 
4%, produce a significant improvement of the convergence rate. Second, we have introduced 
probabilistic selection. This entails that the sample of A^sei selected mutations is constructed 
by always selecting the mutation zq with the lowest E3 value, plus Nsei — 1 mutations selected 
among the total A^mut mutations with probability 

P, = e.p(^-^M_^y (16) 

Namely, mutations with larger E^, values are less likely to be selected but can still be selected 
with finite probability. This is helpful in allowing for mutations which only become beneficial 
after a combination of several individual mutations. 

At the end of the GA training we are left with a sample of iVrep neural networks, from which 
e.g. the value of the structure function at (x, Q^) can be computed as 

■1 ^ rep 

F2ix,Q') = ^ E i^^'^^^^^'H^'Q')- (17) 

The goodness of fit of the final set is thus measured by the P^i' data point, which, given the 
large number of data points is essentially identical to the P^i' degree of freedom: 



where the average over replicas, denoted by {)^^p, is defined in the appendix. 



4 Training to the F2 data 

In order to apply the general method discussed in sect. 3 to the data presented in sect. 2 
several specific issues must be addressed: the choice of training parameters and training length, 
the choice of the actual data set, and the choice of theoretical constraints. We now address 
these issues in turn. 

The parameters and length for the first two training epochs have been determined by in- 
spection of the fit of a neural network to the central experimental values. Clearly, this choice 
is less critical, in that it is only important in order for the training to be reasonably fast, but 
it does not impact on final result. We choose for the first BP epoch 5 x 10^ training cycles 
with learning rate r] = 9 x 10~^ and momentum term a = 0.9, and for the second BP epoch 
2.5 X 10^ training cycles with learning rate r] = 9 x 10~^ and momentum term a = 0.9. 

After these first two training epochs, the diagonal Xdiag = -^2 per data point eq. (fTT|). which 
is being minimized, is of order two for the central data set. This is comparable to the length of 
training that was required to reach Xdiag ~ 1-3 for the smaller data set of ref. [4]. The value of 
E3 eq. p2|) . which is always bounded by it, E^ < E2 is accordingly smaller (see table 3). The 
training algorithm then switches to GA minimization of the E3 eq. (fT^ . The determination of 
the length of this training epoch is critical, in that it controls the form of the final fit. This can 
only be done by looking at the features of the full Monte Carlo sample of neural networks. 
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Table 3: The uncorrelated ^2 eq. ((TT|) and the total x^i -5-3 eq. ((T^ for the fit to the central 
data points: (A) after the backpropagation training epoch and (B) after the final genetic algorithms 
training epoch. 



Before addressing this issue, however, it turns out to be necessary to consider the possibility 
of introducing cuts in the data set. Indeed, consider the results obtained after a GA training of 
4 X 10^ cycles (with mutation rate 5 x 10~^) to the central data set, displayed in table 3. This 
is a rather long training: indeed, in each GA training cycle all the data are shown to the nets. 
Hence in 4 x 10^ GA cycles the data are shown to the nets 0.7 x 10^ times, comparable to the 
number of times they are shown to the nets during BP training. It is apparent that whereas 
i?3 ~ 1 for most experiments, it remains abnormally high for NMC and especially ZEUS94 and 
ZEUSSVX95. Because of the weighted training which has been adopted, this is unlikely to be 
due to insufficient training of these data sets, and is more likely related to problems of these 
data sets. 

Whereas ZEUSSVX95 only contains a small number of data points, NMC and ZEUS94 
account each for more than 10% of the total number of data points, and thus they can bias 
final results considerably. The case of NMC was discussed in detail in ref. [4]. This data set is 
the only one to cover the medium-x, medium-small region (compare figure 1) and thus it 
cannot be excluded from the fit. As discussed in ref. [4], the relatively large value of for this 
experiment is a consequence of internal inconsistencies within the data set. A value of i?3 ^ 1.5 
indicates that the neural nets do not reproduce the subset of data which are incompatible with 
the bulk, as it should be, whereas a value E^ ^ 1 could only be obtained by overlearning, i.e. 
essentially by fitting irrelevant fiuctuations (see ref. [4]). 

Let us now consider the case of ZEUS94. The kinematic region of this experiment is entirely 
covered by ZEUS97, H197, H199, HlOO. We can therefore study the impact of excluding this 
experiment from the global fit, without information loss. The results obtained in such case 
are displayed in table 4: when the experiment is not fitted the E3 value for all experiments 
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Table 4: The same fit as the last column of table 3 if the ZEUS94 data are excluded from the fit. 



with which it overlaps improves and so does the global i?3, whereas i?3 for ZEUS94 itself only 
deteriorates by a comparable amount, despite the fact that the experiment is now not fitted 
at all. We conclude that the experiment should be excluded from the fit, since its inclusion 
result in a deterioration of the fit quality, whereas its exclusion does not entail information 
loss. Difficulties in the inclusion of this experiment in global fits were already pointed out in 
refs. [26,27], where it was suggested that they may be due to underestimated or nongaussian 
uncertainties. It is likely that ZEUSSVX95 has similar problems. However, its inclusion in 
the fit is no reason of concern, even if its high Ej, value were due to incompatibility of this 
experiment with the others or underestimate of its experimental uncertainties, because of the 
small number of data points. It is therefore retained in the data set. Our final data set thus 
includes all experiments in table 1, except ZEUS94. We are thus left with 1698 data points. 

For the sake of future applications, it is interesting to ask how the procedure of selecting 
experiments in the data set can be automatized. This can be done in an iterative way as follows: 
first, a neural net (or sample of neural nets) is trained on only one experiment; then, the total 
Ej, for the full data set is computed using this neural net (or sample of nets); the procedure 
is then repeated for all experiments, and the experiment which leads to the smallest total i?3 
is selected. In the second iteration, the net (or sample of nets) is trained on the experiment 
selected in the first iteration plus any of the other experiments, thereby leading to the selection 
of a second experiment to be added to that selected previously, and so on. The process can be 
terminated before all experiments are selected, for instance if it is seen that the addition of a 
new experiment does not lead to a significant improvement in ii^3 for given length of training. 

We now proceed to discuss the length of training for our final data set. The eq. (jl8|) is 
shown in figure 3 as a function of the number of GA training cycles. The decreases very 
rapidly during the first few hundreds of training cycles. After about 5000 training cycles, the 

as a function of the training length essentially flattens for all experiments but BCDMS. The 
further decrease of the total is then due essentially to the decrease of the contribution from 
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BP training, cycles/10^ | GA training, cycles/10^ Cycles/1000 



Figure 3: Dependence of the sq. ()18|) on the length of training: (left) total training (right) detail 
of the GA training. 



BCDMS. A training length of 4 x 10^ GA cycles is necessary in order for the of BCDMS 
to flatten out at ^ ~ 1.2. As discussed in ref. [4], the BCDMS data can only be learnt with 
a longer training because they have high precision while being located in the intermediate x 
(valence) region, where the parton distribution displays significant variation. 

The i?3 values for the fit of a neural net to the central data with this training is given in 
table 3. It shows that all experiments are well reproduced with the exceptions discussed above. 
It is interesting to observe that while i?3 eq. ()12|) decreases significantly during the GA training, 
the uncorrelated E2 eq. (Illj) decreases marginally, and in fact it actually increases for several 
HERA experiments. This shows that correlations are sizable for the HERA experiments, so 
that the approach of ref. [4], based on the minimization of Xdj^g = -^2 eq. (fTT|) . is not adequate 
in this case. GA minimization appears to be very efficient in reducing the i?3 value relatively 
fast. 

We finally turn to the issue of theoretical constraints. The only theoretical assumption 
on the shape of F2{x,Q'^) is that it satisfies the kinematic constraint F2(1,Q^) = for all 
Q^. As this constraint is local, its implementation is straightforward: it can be enforced by 
including in the data set a number of artificial data points which satisfy the constraint with a 
suitably tuned error. In the present fit we have checked that the best choice is to add a number 
of artificial points at a: = 1, equal to 2% of the experimental trained points (33 points with 
ZEUS94 excluded from the fits), and with error equal to one tenth of the mean statistical error 
of the trained points. These points are equally spaced in InQ^, within the range covered by 
the trained points. 

5 Results 

The result of the minimization of a single neural net to the central data points is shown in 
table 4. The results for the final set of 1000 neural networks are displayed in table 5, while 
in table 6 we give the details of results for each experiment. Note that the figure of merit for 
the minimization eq. (fT^ and its average defined in the appendix eq. (jHUj) differs from the 
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V 

PE 



iVrcp 

(E) 

_p(net) 
(j(net) 



0-(cxp)\ 

/ dat 
(net) ^ 

dat 



,dat 
dat 



cr 



r 



(not) 



1000 
1.18 
2.52 
0.99 
0.54 
1.2 10-3 

80% 

0.027 

0.008 

0.73 



V 



p(nct) 

r){cxp) \ 
^ / dat 
(net) 

dat 



dat 



P 



r 



P 



(net) 



0.20 
0.31 
0.67 
0.54 



V 



(art) 



GOV 

/ dat 



dat 



\ _ / dat 



3.3 10-^ 
1.3 10-^ 
3.6 10-5 
0.49 



Table 5: Estimators of the final results. 



full 6q. (jl8j) not only because the latter is computed from the structure function averaged 
over nets eq. ()17p. but also because of the different treatment of normalization errors in the 
respective covariance matrices eq. (fT^ and eq. Q. 

Besides the we also list the values of various quantities, defined in the appendix, which 
can be used to assess the goodness of fit. 

The quality of the final fit is somewhat better than that of the fit to the central data points 
shown in table 4. In particular, with the exception of NMC (which is likely to have internal 
inconsistencies [4]) and ZEUSSVX95 (which is likely to have the same problems as those of 
ZEUS94 discussed in section 4) the P^r degree of freedom is of order 1 for all experiments. 
It is interesting to note that the for the neural network average is rather better than the 
average (E^) eq. (j30p . The (scatter) correlation between experimental data and the neural 
network prediction equals one to about 1% accuracy, with the exception of NMC, ZEUSSVX95 
(which have the aforementioned problems) and E665. The E665 kinematic region overlaps 
almost entirely (apart from very small < 1 GeV^) with that of NMC and BCDMS, while 
having lower accuracy (this is why the experiment was not included in the fits of ref . [4] ) . The 
data points corresponding to this experiment are therefore essentially predicted by the fit to 
other experiments, thus explaining the somewhat smaller scatter correlation. 

The average neural network variance is in general substantially smaller than the average 
experimental error, typically by a factor 3 — 4. This is the reason why (E) > x'^'- the neural 
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Experiment 



BCDMS 



X 

(E) 
r [Ft-"-")] 



1.47 
2.69 
0.96 
0.59 



1.19 
3.17 

0.99 
0.50 



/„(cxp)\ 
\ / dat 

\ ^ / dat 

(-'"^')^t 



1.20 
2.29 

0.91 

0.56 



0.002 
0.63 
0.017 
0.008 
0.23 



1.9 10"' 
0.56 
0.007 
0.005 
0.98 



0.0013 
0.89 
0.032 
0.008 
0.17 



0.51 
0.17 
0.84 
0.08 



0.69 
0.52 
0.86 
0.73 



r j"cc 



0.29 
0.20 
0.60 
0.05 



4 10-' 
4 10"' 
2 10~' 
-0.03 



1.8 10" 
3.8 10" 
2.3 10" 
0.98 



4.5 10" 
1.7 10" 
3.3 10" 
0.16 



Experiment 


ZEUSBPC95 


ZEUSSVX95 


ZEUS97 


ZEUSBPT97 


H1SVX95 


H197 


H1LX97 


H199 


HlOO 


x' 


1.02 


2.08 


1.35 


0.86 


0.67 


0.71 


1.07 


0.90 


1.11 


(E) 


2.07 


2.03 


2.24 


2.08 


2.03 


1.91 


2.41 


1.93 


2.11 




0.98 


0.96 


0.99 


0.99 


0.97 


0.99 


0.99 


0.98 


0.99 




0.51 


0.66 


0.55 


0.55 


0.44 


0.46 


0.53 


0.48 


0.54 


(n-'"°'>J)dat 
(-M-'""''])dat 


4.3 lO"*' 


0.0035 


0.0010 


1.3 lO"*' 


0.0043 


0.0030 


0.0005 


0.003 


0.0013 


0.91 


0.94 


0.87 


0.72 


0.96 


0.95 


0.75 


0.96 


0.93 


\ / dat 


0.022 


0.061 


0.037 


0.012 


0.063 


0.040 


0.027 


0.051 


0.030 




0.006 


0.013 


0.011 


0.006 


0.011 


0.008 


0.008 


0.008 


0.009 




0.85 


0.72 


0.86 


0.73 


0.84 


0.87 


0.42 


0.82 


0.89 


(-k"l)dat 


0.09 


0.30 


0.12 


0.14 


0.118 


0.14 


0.31 


0.16 


0.14 


/p(oxp)\ 

\ / dat 


0.61 


0.24 


0.28 


0.40 


0.36 


0.06 


0.29 


0.05 


0.09 




0.77 


0.64 


0.39 


0.63 


0.57 


0.27 


0.58 


0.29 


0.26 




0.53 


0.40 


0.66 


0.60 


0.48 


0.51 


0.69 


0.37 


0.55 


(-H'-"J)dat 


6.4 10"^ 


1.9 10"'^ 


3.4 10"^ 


1.4 lO"'' 


3.0 10"'^ 


3.8 10"^ 


3.8 10"* 


2.7 10"" 


1.7 10"'' 


\ 1 dat 


2.8 lO"*' 


8.5 10"'' 


3.7 lO"*' 


5.8 10"^ 


0.0014 


1.0 lO"*' 


2.1 lO"*' 


1.4 10"' 


9.6 10"^ 


/eov<"=')\ 
\ / dat 


2.8 10"^ 


1.2 10"*" 


3.2 10"^ 


2.3 10"^ 


7.0 10"^ 


1.510"^ 


6.9 10"^ 


1.6 10"^ 


2.2 10"^ 




0.69 


0.48 


0.77 


0.65 


0.53 


0.61 


0.57 


0.54 


0.58 



Table 6: Final results for the individual experiments: fixed target (top) and HERA (bottom) 



nets fluctuate less about central experimental values than the Monte Carlo replicas. In the 
presence of substantial error reduction, the (scatter) correlation between network covariance 
and experimental error is generally not very high, and can take low values when a small number 
of data points from one experiment is enough to determine the outcome of the fit, such as in 
the case of the NMC experiment, even more so for E665. [4] 

As discussed extensively in ref. [4] it is important to make sure that this is due to the fact 
that information from individual data points is combined through an underlying law by the 
neural networks, and not due to parametrization bias. To this purpose, the 7?.-estimator has 
been introduced in ref. [4], where it was shown that in the presence of substantial error reduction 
7^ > 1 if there is parametrization bias, whereas 71. ~ 0.5 in the absence of parametrization bias.^ 

^Note that in ref. [4] the 7?.-ratio was defined in terms of the diagonal Xdiag OD' because neural net'works 
were trained by minimizing that quantity. It is easy to see that the results of ref. [4] for 72. remain true when 
the full E-i eq. (I12|l is minimized, provided 72 is redefined accordingly. 
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<CT> 



0.0060 



0.0050 




260 280 
BP training, cycles/10^ 



300 320 340 

GA training, cycles/10^ 



Figure 4: Dependence of the length of training for the BCDMS experiment. 

\ / dat 



It is apparent from tables 5-6 that indeed TZ ~ 0.5 for all experiments. Note that, contrary to 
what was found in ref. [4], there is now some error reduction also for the BCDMS experiment, 
though by a somewhat smaller amount than for other experiments. We will come back to this 
issue when comparing results to those of ref. [4]. 

Further evidence that the error reduction is not due to parametrization bias can be obtained 
by studying the dependence of the length of training. This dependence is shown 

\ / dat 

in fig. lUfor the BCDMS experiment. It is apparent that the error reduction is correlated with 
the goodness of fit displayed in fig. El and it occurs during the GA training, thereby suggesting 
that error reduction occurs when the neural networks start reproducing an underlying law. If 
error reduction were due to parametrization bias it would be essentially independent of the 
length of training. 

The point-to-point correlation p of the neural nets is somewhat larger than that of the data, 
as one might expect as a consequence of an underlying law which is being learnt by the neural 
nets. In fact, for the NMC experiment the increase in correlation essentially compensates the 
reduction in error, in such a way that the average covariance of the nets and the data are 
essentially the same. This again shows that in the case of the NMC experiment a small number 
of points is sufficient to predict the remaining ones. For all other experiments, however, the 
covariance of the nets is substantially smaller than that of the data. As a consequence the 
(scatter) correlation of covariance remains relatively high for all experiments, except NMC, 
and especially E665 whose points are essentially predicted by the fit to other experiments. 

The structure function and associated one-o" error band is compared to the data as a function 
of X for a pair of typical values of in fig. |31 In fig. IHlthe behaviour of the structure function 
as a function of x at fixed and as a function of at fixed x is also shown. It is apparent 
that in the data region the error on the neural nets is rather smaller than that on the data used 
to train them. The error however grows rapidly as soon as the nets are extrapolated outside 
the region of the data. At large x, however, the extrapolation is kept under control by the 
kinematic constraint ^2(1,(5^) = 0. 

Let us finally compare the determination of F2{x,Q'^) presented here with that of ref. [4], 
which was based on pre-HERA data. In fig. U\ one-cx error bands for the two parametrizations 
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X 



Figure 5: Final results for F2{x,Q'^) compared to data. For the neural net result, the one-o" error 
band is shown. 



are compared, whereas in fig. IHlwe display the relative pull of the two parametrizations, defined 
as 

^ ^ (19) 

where a{x^ Q^) is the error on the structure function determined as the variance of the neural 
network sample. In view of the fact that the old fit only included BCDMS and NMC data, it 
is interesting to consider four regions: (a) the BCDMS region (large x, intermediate Q^, e.g. 
X = 0.3, Q"^ = 20 GeV^); (b) the NMC region (intermediate x, not too large Q^, e.g. x = 0.1, 

= 2 GeV^); (c) the HERA region (small x and large Q^^ e.g. 0.01 and > iq GeV^); 
(d) the region where neither the old nor the new fit had data (very large or very small Q^). 
In region (a) the new fit is rather more precise than the old one, for reasons to be discussed 
shortly, while central values agree, with P < 1). In region (b) the new fit is significantly more 
precise than the old one, while central values agree to about one sigma. In region (c) the new 
fit is rather accurate while the old fit had large errors, but P ^ 1 nevertheless, because the 
HERA rise of F2 is outside the error bands extrapolated from NMC. This shows that even 
though errors on extrapolated data grow rapidly they become unreliable when extrapolating 
far from the data. Finally in region (d) all errors are very large and P is consequently small, 
except at small x and large Q^, where the new fits extrapolate the rise in the HERA data, 
which is missing altogether in the old fits. 

Let us finally come to the issue of the BCDMS error, which, as already mentioned, is reduced 
somewhat in the current fit in comparison to the data and the previous fit. This may appear 
surprizing, in that the new fit does not contain any new data in the BCDMS region. However, 
as is apparent from fig. IH this error reduction takes place in the GA training stage, when 
E2, eq. (fT^ is minimized. Furthermore, we have verified that if the uncorrelated x^iag — -^2 
eq. ()11|) is minimized during the GA training no error reduction is observed for BCDMS. Hence, 
we conclude that the reason why error reduction for BCDMS was not found in ref. [4] is that in 
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10-5 10-* 10-3 10-2 10-1 iflO 10-*^ 10-1 loO ifli lo^: lO'J lo* 

Figure 6: One-o" error band for the structure function F2{x,Q^) computed from neural nets. Note 
the different scale on the y axis in the two plots. 



that reference neural networks were trained by minimizing E2. In fact, as discussed in sect. 4, 
the BCDMS experiment turns out to require the longest time to learn, especially after inclusion 
of the HERA data. Error reduction only obtains after this lenghty minimization process. 

6 Conclusion 

We have presented a determination of the probability density in the space of structure functions 
for the structure function F2{x, Q^) for the proton, based on all available data from the NMC, 
BCDMS, E665, ZEUS and HI collaborations. Our results take the form of a Monte Carlo sample 
of 1000 neural networks, each of which provides a determination of the structure function for 
all {x,Q'^). The structure function and its statistical moments (errors, correlations and so on) 
can be determined by averaging over this sample. Results are made available as a FORTRAN 
routine which gives F2{x, Q^), determined by a set of parameters, and 1000 sets of parameters 
corresponding to the Monte Carlo sample of structure functions. They can be downloaded from 
the web site http://sophia.ecm.ub.es/f2neural/ 

This works updates and upgrades that of ref. [4], where similar results were obtained from 
the BCDMS and NMC data only. The main improvements in the present work are related to the 
need of handling a large number of experimental data, affected by large correlated systematics. 
Apart from many smaller technical aspects, the main improvement introduced here is the use 
of genetic algorithms to train neural networks on top of back-propagation. This has allowed 
for a more accurate handling of correlated systematics. 

Whereas the results of this paper may be of direct practical use for any application where an 
accurate determination of (2^) Q"^) ^i-nd its associate error are necessary, its main motivation 
is the development of a set of techniques which will be required for the construction of a full set 
of parton distributions with faithful uncertainty estimation based on the same method. This 
will be presented in a forthcoming publication. 
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Figure 7: Comparison of the parametrization of F2{x,Q'^) of ref. [4] (old) with that of the present 
paper (new). The pairs of curves correspond to a one-o" error band. 
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A Statistical estimators 

We define various statistical estimators which have been used in the paper. The superscripts 
(dat), (art) and (net) refer respectively to the original data, to the A^^ep Monte Carlo replicas 
of the data, and to the A^rep neural networks. The subscripts rep and dat refer respectively to 
whether averages are taken by summing over all replicas or over all data. 

• Replica averages 

— Average over the number of replicas for each experimental point i 



F 



(art)\ 



/ rep M 



(art)(fe) 



rep 



(20) 



Associated variance 



a, 



(art) 



rep 



(art) 



rep 



(21) 



Associated covariance 



„(art) 
Pij 



^(art)^(art)\^ 



rep 



:i(art) 



rep 



_p(art)\^ 



rep 



COV 



(art) (art) 

a] 'a) 



(art) (art) (art) (art) 



Pij cr,- 



(22) 
(23) 
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Mean variance and percentage error on central values over the A'jat data points. 



V 



{PE 



_p(art) 



_p(art) 



rep 



rep 



dat A^dat 



n(art) 



(exp) 



rep 



dat A'dat 



1 A^dat 

E 

i=l 



(art) 



-E 



(exp) 



rep 



n(exp) 



(24) 
(25) 



We define analogously ( V 



a 



(art) 



rep 



dat 



V 



^(art) 



rep 



dat 



V 



cov 



(art) 



rep 



dat 



and ( PE 



a 



(art) 



rep 



dat 



Scatter correlation: 



art) 



_p(exp) / _p(art) 



■^CP/ dat 



_p(exp) \ / / _p(art) 



dat 



■^CP/ dat 



(exp) (art) 

al 'as 



where the scatter variances are defined as 



^(-P) = y^(^(exp))2\^^^^ _ ((F(-P)) 



dat^ 



a 



(art) 



We define analogously r 



^^(art)^ 
(art) 



rep 
p(art) 



dat 

and r 



'•eP/dat 



(26) 



(27) 



(28) 



cov 



(art) 



Note that the scatter 



correlation and scatter variance are not related to the variance and correlation Eqs. 

mm 

Average variance: 



a 



(art) 



1 



dat M 



A^dat 

E 



a. 



(art) 



dat i—i 



(29) 



We define analogously (p^^^^n and (cov'^^'^*)) , as well as the corresponding ex- 

\ / dat \ / dat 

perimental quantities. 
Neural network averages 

— Average over nets 



A^r, 



E Es^'^ > 



(30) 



'rep 

where £'3'^'' is the X2 given in eq. (|T^ . 

Mean variance and percentage error on central values over the A^dat data points. 



V 



PE 



rep 
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A^da 



dat jVdat j=i 



Jim 



n(net) 



n(cxp) 



dat A'dat 



E 



E 



(net) 



rep 



rep 



E 



(exp) ■ 



n(exp) 



(31) 
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We define analogously percentage errors on the correlation and covariance. 
Scatter correlation 



{E) , , 

^=77^> (34) 



_p(exp) /^(not)\ \ _/_p(oxp)\ //_p(net)\ \ 

^(net)l ^ \ ^ ^■^'^p/dat ^ /dat\\ ' '^^P I Ai^t /'^'^\ 

(exp) (not) ■ ^ ' 

as ah 

We define analogously (p^'^^^n and (cov*^"''*^) 

\ / dat \ / dat 

7^-ratio 

V = 

(E) 

where {E) is given by eq. (jHDI) and 

-1 -^rcp 

e) = ^EE^'^ (35) 

1 ^dat 

^(k) ^ l^pinet)(k) _ ^(exp)(fc)^ COV^'')"^ |^^(net){fc) _ ^(exp)(fe)^ ^^q-^ 
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